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Abstract 

We explore the theoretical and numerical property of a fully Bayesian model selection method in 
sparse ultrahigh-dimensional settings, i.e., p » n, where p is the number of covariates and n is the 
sample size. Our method consists of ( 1 ) a hierarchical Bayesian model with a novel prior placed over the 
model space which includes a hyperparameter t„ controlling the model size, and (2) an efficient MCMC 
algorithm for automatic and stochastic search of the models. Our theory shows that, when specifying 
t„ correctly, the proposed method yields selection consistency, i.e., the posterior probability of the true 
model asymptotically approaches one; when t„ is misspecified, the selected model is still asymptotically 
nested in the true model. The theory also reveals insensitivity of the selection result with respect to the 
choice of t n . In implementations, a reasonable prior is further assumed on t„ which allows us to draw 
its samples stochastically. Our approach conducts selection, estimation and even inference in a unified 
framework. No additional prescreening or dimension reduction step is needed. Two novel g-priors are 
proposed to make our approach more flexible. A simulation study is given to display the numerical 
advantage of our method. 

Keywords and phrases: model selection, fully Bayesian method, ultrahigh-dimensionality, posterior con 
sistency, size-control prior on model space, generalized Zellner-Siow prior, generalized hyper- g prior, con 
strained blockwise Gibbs sampler, simultaneous credible interval. 
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1 Introduction 



Suppose the ^-dimensional response vector Y = {y\ , . . . , y n ) T and the n by p covariate matrix X = (Xi , . . . , X p ) 
are linked by the linear model 

Y = XJ3 + e, (1.1) 

where the X;S, j = 1, ... , p, are rc-vectors, f} = (fii,... ,fi p ) T is an unknown /j-vector of regression co- 
efficients, and e = (e\, . . , , e n ) T is an rc-vector of random errors. The true parameter vector /3 contains s n 
nonzero components and p — s n zeros. Here we assume p » n, i.e., p/n — > oo as « — > oo, but ideally restrict 
s„ = o(«), i.e., the true model is sparse. Our goal is to explore an automatic fully Bayesian procedure for 
selecting and estimating the nonzero /3,-s in d 1 - lb - in the "large-/?-small-7i" scenario. 

In frequentist settings, there is a vast amount of literature about variable selection in sparse ultrahigh- 
dimensional models. We only list a few representative ones. Based on LASSO, tB^l 132"! |45ll5Tl 1331 obtained 
selection consistency when p is growing exponentially with n, i.e., log p - 0{n a ) for some a > 0. Selection 
consistency here means, as n goes to infinity, with probability approaching one the selected model is the true 
model. 11241 considered bridge regression, a link between the LASSO and ridge regression, and obtained 
selection consistency. [28 ] proposed a unified approach based on regularized least squares with a class 
of concave penalties, fffl [T6l proposed sure independence screening (SIS) based on correlations. ||46l 
proved selection consistency using BIC criteria. ||49l examined several multi-stage selection approaches. 
|43l applied a regularized likelihood approach based on nonconvex constraints and proved selection and 
estimation consistency. [6] proposed a new method for variable selection without using penalty. There are 
many other frequentist approaches handling this research area; see [ 15 ] for an insightful review. 

In Bayesian framework, selection consistency is somewhat different from the one in frequentist setting. Un- 
like the frequentist setting which treats the true model as fixed a priori, Bayesian approaches assume the 
model as a random element which has 2 P possible choices. Under proper Bayesian hierarchical models, it 
is possible to derive the posterior distribution of the model. In other words, the posterior probabilities of 
all the TP models are achievable. We say such procedure is posterior consistent if the posterior probability 
of the true model converges to one. A nice property of the Bayesian approach is that it can evaluate all the 
possible models based on the posterior probabilities and provide a stochastic search, though an MCMC pro- 
cedure might be needed. Besides, it can simultaneously conduct estimation and inference over the selected 
coefficients through the posterior samples. 

Posterior consistency has been theoretically established when p is fixed (see |[T7l[34ll29l l9l). [29] obtained 
posterior consistency in the setting of mixture of g-priors for fixed p. f39l extended these results to the 
growing p situations. Their results cover both p < n and p » n. For p » n, they examined a two-step 
procedure. Explicitly, in step I a dimension reduction (or prescreening) procedure such as SIS proposed 
by fBll is performed to obtain a reduced model space, and in step II the Bayesian selection procedure is 
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performed over the reduced model space. However, the two-step scheme has several drawbacks. According 
to Ifl4l . to yield better selection accuracy, the data has to be divided into two subsamples with one for 
SIS and the other for Bayesian model selection. This additional prescreening step introduces additional 
complexity in applications, and very often one has to determine the sizes of both subsamples, though a 
default choice may be an equal separation. Furthermore, in many high-dimensional problems, the number of 
predictors p can be much larger than the sample size n, so the sizes of both subsamples become even smaller. 
Usual Bayesian selection procedures based on a smaller subset of the data may cause selection inaccuracy. 
Motivated by these considerations, an automatic one-step Bayesian method, which does not involve any 
prescreening or dimension reduction procedure, is highly needed and useful in both theoretical and applied 
aspects. Related theoretical results on Bayesian model selection include (2j[35l|22l who proved consistency 
of Bayes factors when p = 0(n). [26] placed a set of novel non-local priors over the model coefficients 
and proved posterior consistency for p < n. Recently, [3] proposed a non-fully Bayesian selection method 
which works under /? » n but requires thresholding the marginal posterior means of fi. 

In this paper, we explore the theoretical and numerical property of a fully Bayesian model selection proce- 
dure in sparse ultrahigh-dimensional situations where p is allowed to grow exponentially with n. In our ap- 
proach, stochastic model search, parameter estimation and even inference can be simultaneously conducted 
in a unified framework, though an MCMC procedure is employed for these goals. No additional steps such 
as dimension reduction or thresholding are needed. Our model includes a hyperparameter controlling the 
size of the target models, namely, the size-control parameter. A set of mild sufficient conditions are pro- 
vided under which posterior consistency holds when this size-control parameter is correctly specified, i.e., it 
is greater than the size of the true model. We also examine the selection performance when the size-control 
parameter is misspecified. To the best of our knowledge, our work is the first one establishing posterior con- 
sistency of the fully Bayesian model selection method in ultrahigh-dimensional settings, and theoretically 
examining the effect of a misspecified size-control parameter on model selection result. To make the model 
more flexible, we propose two new types of g-priors extending those in ||50] |29 1 to ultrahigh-dimensional 
settings. Posterior consistency under these priors is established. A prior over the size-control parameter 
is considered which largely avoids misspecification, and induces a nontrivial extension of the traditional 
sampling scheme. The simulation study reveals that the proposed method is computationally accurate and 
convenient. 

The rest of this paper is organized as follows. In Section |2] a Bayesian hierarchical model involving suitable 
priors is explicitly given. Section [3] contains the theoretical results which justify posterior consistency and 
evaluate the effect of misspecifying the hyperparameter controlling the model size in various situations in- 
cluding the g-prior. New types of g-priors are constructed in this section. We also briefly discuss the credible 
interval construction over the selected coefficients. Section @] presents the computational details involving 
a constrained blockwise sampling procedure. In Section [5] a simulation study is given to demonstrate the 
performance. All the technical proofs are given in the appendix. 
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2 A hierarchical model with a size-control prior on model space 



Before formally describing our models, we first introduce some notation that are used frequently throughout 
this paper. Define yj = I(J3j + 0), i.e., the 0-1 variable indicating the exclusion or inclusion of fij, and define 
y — (yi, . . , ,y p ) T . Throughout we use |y| to denote the number of ones in y. Clearly, each y corresponds to 
a candidate model Y = X y /? y + e, where X y is an n x \y\ submatrix of X, and fi y is the subvector (with size 
|y|) of fi, whose columns and elements are indexed by the nonzero components of y, respectively. The TP 
possible ys correspond to the 2 P different models, which form the entire model space. For any y and y', let 
(y\y')j = I(yj = l,y'j = 0), and (y n y')y = I(yj = l,y'j = 1)- Thus, y\y' is the 0-1 vector indicating the 
variables present in y but absent in y', and y n y' is the 0-1 vector indicating the variables present in both y 
and y'. We say that y is nested in y' (denoted by y c y') if y\y' is zero. Denote the true model coefficient 
vector by fi° and the corresponding 0-1 vector by y°, and let s n = |y°| denote the size of the true model. 

We adopt a normal linear model between the response and covariates, i.e., 

Y\p,(T 2 ~ N(Xfi,cr 2 I n ). (2.1) 

Suitable prior distributions are required for the parameters /? and o 2 . We adopt the "spike-and-slab" prior 
for /3jS, i.e 

Pj\Yj, a 2 -{I- 7j )5 + 7j N(0, c j( r 2 ), (2.2) 

where <5o(0 is the point mass measure concentrating on zero, and c/s are temporarily assumed to be fixed. 
Note that c/s are used to control the variance of the nonzero coefficients, and therefore are called the 
variance-control parameters. In next sections we will treat the mixture of g-prior setup, i.e., assuming priors 
on c/s. The "spike-and-slab" prior has been explored in various applied aspects by |[44l ITTl ITOl |48l [3TTl . 

We place an inverse x 1 prior on cr 2 , i.e., 

l/cr 2 ~x 2 v , (2-3) 

where v is a fixed hyperparameter. Other choices such as the noninformative priors or inverse Gamma priors 
can also be applied. The theoretical results derived in this paper can be extended without further difficulty 
to these situations. 

A prior probability, namely, p{y), should be assigned to each candidate model y, i.e., 

y ~ p(y). (2.4) 

A popular choice of p(y) is the so-called independent Bernoulli prior used ETJl |2T1 [T9l |4j HI [36l [30j 1401 . 

or the Bernoulli-Beta prior used by 1071 [3] |3TI . The independent Bernoulli prior assumes each covariate 

p 

to be included in the model with probability 6i, i.e., p(y) - f\ 6 7j (\ - 9j) l ~ 7j with OjS being fixed. The 

;=i ; 

Bernoulli-Beta prior assumes further a Beta prior over 8js. 
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In many practical applications, such as genewise selection, only a small amount of covariates should be 
included in the model, which can be treated as, in Bayesian terminology, prior information. Thus, most of 
the candidate models, especially those with large model sizes, should be assigned a tiny or even zero prior 
probability. In Bernoulli prior, this can be achieved by assuming a very small but positive 8j. Due to the 
huge number of candidate models, of which most are "incorrect", even though each "incorrect" model is 
assigned a very small prior probability, the aggregated prior probability over all the "incorrect" models can 
still be large. This will severely affect the accuracy of the Bayesian model selection procedure when p » n. 
Here we propose a novel prior that only assigns positive weights to the models with smaller sizes, i.e, a 
size-control prior on model space. Namely, 

p(r) U, ifw<v (25) 

{ 0, otherwise, 

where n 7 for \y\ < t„ are fixed positive numbers, and t n e (0, n) is an integer-valued hyperparameter con- 
trolling the sizes of the candidate models. Clearly, (12.51 ) is more powerful than Bernoulli or Bernoulli-Beta 
prior to screen out the models with larger sizes. When the number of nonzeros in i.e., s n , is small so that 
t„ > s n , this implies (12.51) is powerful to screen out the "incorrect" models with greater sizes. 

Based on the above Bayesian hierarchical model (|2.1| )- (|2.5K the joint posterior distribution for (fi,y,cr 2 ) 
can be derived. For simplicity, denote Z = (Y, X) to be the full data variable. The joint posterior distribution 
is then 

p(fi,y,cr 2 \Z) oc p{Z\p,o 2 )p(fi\ ( T 2 ,y)p{y)p{cT 2 ) 



where cf>{-) is the density function of the standard normal random variable, j € y means the index j e 
{1, . . . , p) satisfies ji - 1 and j e -y means yj = 0, p(y) is the prior defined as in 02.51 . Integrating out /? 
and cr 2 in (12.61) one obtains 



p(y\Z) (x: det(W r r 1/2 K 7 ) (l + Y r (I„ - XyU^X^Yf^' 2 , (2.7) 

where W y = 'Ly 2 \]- Y l}J 2 , U 7 = 2^ 1 +X£ X y , and L y denotes the principle submatrix of £ = diag(ci c p ) 

indexed by y. Here we adopt the convention that X@ = and = U© = W@ = 1, where means the null 
model, i.e., the vector y with all elements being zero. 

The optimal model y is chosen to maximize (12.71) . i.e., 

y - arg max p(y\Z). (2.8) 
r 

In other words, y achieves the highest posterior probability among all the possible models. When \y\ > t n , 
p{y\Z) = 0. So maximizing ( 12.81 ) is actually performed over a smaller model space named as the target 



5 



model space. We name the model selection procedure (12.81 ) as Bayesian ultrahigh-dimensional screening. 
Ideally we hope to show that the selected model y is asymptotically exactly the true model y°. This is 
equivalent to showing that p(y°\Z) is asymptotically greater than p(y\Z) for any y ± y°, which holds if 
p(y°\Z) converges to one in certain mode. 

3 Main results 

In this section, we present our main results on posterior consistency. Throughout we suppose y° + 0, that 
is, the true model is not empty. Our first result shows that when properly choosing t n > s n , under certain 
mild conditions, p(y°\Z) converges in probability to one, where convergence holds uniformly for c/s lying 
within certain ranges. Since typically s n is unknown, one may face a risk of misspecifying t n so that t n is 
actually smaller than s n . Theoretical results are thus needed to examine this situation. Our second result 
shows that when < t„ < s n , with probability approaching one, the selected y is nonnull and is nested to 
the true model, implying that all the selected variables are significant although there are other significant 
variables not selected. 

Throughout this whole section, we define P r = X y (XyX y )~ 1 Xy , i.e., the projection matrix based on X y . 
We adopt the convention that P© = 0. Let A- (A) and A+(A) be the minimal and maximal eigenvalues of the 
square matrix A. Suppose there exist positive sequences <p and d>„ such that <p < c; < d> n for / = 1 p. 

— n — n J 

Denote k„ = Hj8 l| 2 and tp„ = min |j6°|, where fiP. denotes the jth element of fi° and || • || denotes the ^-norm. 
3.1 When t„ > s„ 

We first consider the case t n > s n , that is, the size-control parameter t n is correctly specified as being greater 
than or equal to the size of the true model. In this case, the true model y° has positive posterior probability, 
and thus, is among our target model space. 

To prove p(y°\Z) asymptotically approaches one, we introduce some useful notation and technical assump- 
tions. Define S\{t n ) = {y\y° c y,y ± y°,|y| < t n ] and S^fe) = {y|y° is not nested in y,\y\ < t n }. It is 
clear that S \ (t n ) and S 2(t„) are disjoint, and S (?„) defined by S (/„) - 5 \ (t n ) [j Si{t n ) l_J{7°} i s tne c l ass °f 
all models with size not exceeding t n . To insure a flexible choice of t n , we assume t„ € [s n ,r n ] for some 
integer r n > s n . Our result in this section shows that when properly fixing the upper bound r n , any choice 
of t n e [s n , r„] will guai - antee that the ttue model is selected. This says that the selection result is somewhat 
insensitive to the choice of t n within certain range. 

Assumption A.l There exists a positive constant cq such that, as n — > oo, with probability approaching 
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l ~T ri n w I ^ „„„ i / 1 vT 



one, for any t n € [s n , r n ], 

1/C ° - r™, A ~ [n X r°\r (In ~ ~ r S) A+ [nK^Xrj * c °> 

and 

min aA-X\ (I n -P y0 )X\ |>l/c . 
Assumption A.2 sup max -vtt: < °°- 

f„e[.v„,r„] 

Assumption A.3 77ze sequences s n , r n , <p n , <p , k n , and ijj n satisfy, as n — > oo ? 

(7). = o(n); 

(ii). n\J/ 2 n -> oo; 

(Hi). s„ < r n < n/2 andr n \ogp - o(nlog(l + min{l, i/^})); 

(7vJ. 5„log(l + c o n0„) = o(nlog(l + min{l,^})); 

fvj. log p = o(log ) and k n - 0((f> ). 

— n — n 

Remark 3.1 We briefly discuss the validity of Assumptions POI to \A.3\ We first have the following result 
showing that Assumption IA.il holds under a very broad range of situations. Its proof is similar to that of 
Proposition 2. I in H39V . and thus is omitted. 

Proposition 3.1 Assumption \A. 1 1 is satisfied if there exists co > such that 

c" 1 < min A_ l-XZXy) < max A + \-XlX Y ) < c . (3.1) 

U \r\<2r„ \n 7 r j \y\<2r„ \tl 7 ") 

(13. 1\) is called the sparse Riesz condition, a standard condition in the study of high-dimensional problems; 
see / I571 1221/ far applications in LASSO. Proposition IJ.il confirms that the sparse Riesz condition is even 
stronger than our Assumption POI Assumption \A.2\ holds if we place indifference prior over y with \y\ < t n , 
which implies = 1. 

To see when Assumption IA.il holds, let us consider a simple scenario. Suppose if/ n - n~ kl , s„ = n kl , 
r n - n ki and log p = n k4 , where k.4 > 0, k\,k2,k^ are nonnegative satisfying k^ < ^3 and 2k\ + £3 + k^ < 1. 
Furthermore, log k n = 0(log n) which is a weaker assumption than [25 /. Then it can be shown directly that 
<f> n and cp with log0„ = o(n l ~ 2k[ ~ kl ) and n k4 = oilogcp ) satisfy Assumption \A.3\ In this simple situation, 
both <j> n and (p are growing exponentially with n. In other words, they have to be large enough to support 
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the high-dimensional selection. Here we want to emphasize that the upper bound for (f> n and the lower 
bound for <p are both necessary for selecting the true model; see K44\l for heuristic explanations in a lower- 
dimensional situation. 

Theorem 3.2 U nder Assumptions \A.l\ through \A.3\ as n — > oo, 

min inf p(y°\Z) — > 1, in probability . 

s n <t n <r„ (j) <ci,...,c p <(f)„ 

The proof of Theorem 13 .2 l is given in the appendix. Theorem 13.2 I pro vides a set of sufficient conditions under 
which, uniformly for cys e [<f> ,(f> n ] and t n e [s n ,r n ], posterior consistency holds. In other words, selection 
accuracy is not sensitive to the values of these hyperparameters when they are in a proper range. These 
conditions are satisfied when p = 0(exp(n k4 )) for some k\ € (0, 1) (see Remark [37Tb . thus, Theorem 13.21 
holds in ultrahigh-dimensional settings. The proof of Theorem 13 .2 I relies on finding the sharp upper bounds 
of the Bayes factors between models including t n as a component. It is shown that uniformly for t n e [s n , r n ] 
with s n and r n growing at certain rates, all these upper bounds can be well managed so that the posterior 
probability of the true model converges to one. In next section, we further examine the performance of our 
Bayesian selection method when t n is misspecified, i.e., t n < s n . 

In computations (Section 01), to enhance flexibility, we further assume a prior p{t n ) over t n . Concretely, in 
simulation study (Section [5]) we chose the improper prior p(t n ) = l(t n <m n ) with some given m n > 0. Here 
m n represents our prior belief on the range of s n , the number of true nonzeros. To be conservative, we set 
m n = n/2, a commonly accepted upper bound in sparse high-dimensional problems (see [7 ]), but still find 
satisfactory selection accuracy. 

Here we want to compare Theorem 13.21 with literature. There are two major types of Bayesian model 
selection procedures explored in literature, i.e., the Bayes factor and the fully Bayesian approach based on 
hierarchical models like (l2.HHT2.51 l. Bayes factor is a useful tool for pairwise model comparison and is 
equivalent to the fully Bayesian model selection when p is fixed (see |H]|29l). When p < n is increasing 
with n, these two types of selection methods are not equivalent (see |[39l ). In this case, [2l[35j|22l proved 
consistency for Bayes factors which holds even for p = 0(n). 

In contrast, the fully Bayesian approach evaluates all the 2 P models and selects the model with the highest 
posterior probability, and thus, is essentially different from Bayes factor in the setting of growing p. Impor- 
tant literature includes ifTTI |34l |29l [9] who showed selection consistency for fixed p. Later on these results 
were generalized to increasing p with p < n in a range of hierarchical models; see |T39ll26l . To the best of our 
knowledge, Theorem 13. 2 1 is the first result establishing posterior consistency for a fully Bayesian method in 
ultrahigh-dimensional settings. [38 1 also describes a two-step procedure so that selection consistency holds 
for p » n. Of course this procedure is not fully Bayesian since a preliminary step such as SIS is performed 
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before formal selection. Instead, the selection method introduced in this paper is performed by directly 
fitting the hierarchical model (I2.1l i- (l2.51 l. No additional steps such as SIS or posterior mean thresholding 
considered by [3 ] are needed. The key is the application of the prior (|2.51 l. We believe when adopting this 
prior, other existing results valid for p < n can also be extended to p » n. 

3.2 When < t n < s n 

Now we turn to the case of misspecifying the hyperparameter t„ so that actually < t n < s„. In this case, the 
true model y°, which has posterior probability zero, is outside our target model space and thus is impossible 
to be selected out. We will show that even in this false setting the selected model y is asymptotically nested 
in the true model. In other words, all the selected variables are significant ones which ought to be included 
in the model. 

Define T Q (t n ) = {y|0 < |y| < t n ,y c y }, T x {t n ) = {y|0 < |y| < t n ,y n y° * 0, y is not nested in y }, 
and T2(t n ) = {y|0 < |y| < t n ,y n y° = 0}. It is easy to see that T()(t n ),T\(t n ),T2(t n ) are disjoint and 
T(t n ) = To(t n ) U T\(t n ) U T2(t n ) is exactly the class of y with |y| < t n . Throughout this section, we make the 
following assumptions. 

Assumption B.l There exist a positive constant do and a positive sequence p n such that, when n — > oo, with 
probability approaching one, 




(3.2) 




(3.3) 



0<t n <s, 



Assumption B.2 sup max ^* < oo. 

0<t n <s„ 



Assumption B.3 The sequences s n , cp , k n , ijj n and p n satisfy, as n — » oo, 



(i). s„ = 



o(ri); 



(ii). ntfrl —> oo; 



(Hi). s„ - o(mJfl); 



(iv). k n = 0(<p ); 

— n 



(v). max{p n ,s; 



% log p} = o{mm{n, log(^)}). 
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Remark 3.2 Before stating our main theorems in this section, let us examine the validity of the Assumptions 
\B.1WB. 3\ Assumption \B.2\ holds if we adopt the indifference prior, i.e., p(y) is positive constant for all 
y e T(t n ). The following result demonstrates the validity of Assumption \B~1\ in a special situation, though 
we believe this condition may still hold in more general cases. 

Proposition 3.3 Suppose the rows of X are iid copies of(^i,... ,£ p ) which is a zero-mean Gaussian vector 
with E{£j} = 1 for 1 < j < p. The vector (£i, . . . ,£ p ) is a subvector of the infinite population sequence 
j = 1,2,...} which satisfies the Riesz condition, i.e., (4.5) in H51\l . Furthermore, s n log p = o(n) and £j 's 
are independent. Then Assumption \B. l\ holds for p n — as„ log p with any constant a > 4. 

Proposition \3.3\ is proved in the Appendix. In the setting of Proposition [O] we may choose p n x s\ log/?. 
Suppose log&„ - O(logTi) and choose (p such that log <f> > n. Let if/ n = n~ k[ , s n — n kl and log p — n k *, 
where k^ > 0, k\,k.2 are nonnegative satisfying 1k\ + k% + k$ < 1 and 2&2 + &4 < 1. Then it can be 
easily verified that Assumption \B~3\ holds in this particular situation. Clearly, Assumptions \B.3\ and \A.3\ are 
not contradictive in that there exist {p, s n ,iff n ,<f> ,(f> n ,k n \ satisfying both conditions. The difference is that 
Assumption \A. 3\ also involves r n , i.e., the upper bound for the hyperparameter t n , while Assumption \B. 3\ does 
not since t n has already been assumed to be bounded by s n . The careful readers may also notice that, unlike 
Assumption \A.3\ which places both upper bound for (f> n and lower bound for (f> , in Assumption \B.3\ only 
lower bound for d> is assumed. The reason is, in the subsequent Theorem \3.4\ we allow y = 0, a model in 

—n 

To(t n ). This case is preferred when all the CjS tend to infinity (corresponding to <p„ - oo); see H44V . Thus, 
the upper bound for cj> n is not necessary. Actually, in the below Theorem \3.5\ where we show in a situation 
that y is nested in the true model buty + 0, an upper bound for (p n will still be needed. 

Next we state our first theorem in this section. 

Theorem 3.4 Under As sumptions \B. 1[]B.3] as n — > oo, 



The proof of Theorem 13 .4 1 can be found in the Appendix. Theorem 13.41 examines the situation of misspec- 
ifying the hyperparameter t n , i.e., t„ < s n . It says that in such situation, even though the selected model y 
cannot be the true model since necessarily \y\ < s„,y can still be nested to the true model with probability 
approaching one. Furthermore, convergence holds uniformly for < t„ < s n and c,s within certain range. 

Theorem 13.41 allows y - 0. However, when the true model is nonnull, we may ask further if y can be 
nonnull. The following result provides a positive answer to this question. The price we pay is an additional 
assumption to separate a nonnull model from the null model. 




— > 0, in probability . 



max 

0<t n <s, 



sup 

^<ci,...,c p <0, 



max p(y\Z) 
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Theorem 3.5 Suppose we happen to choose some t n from (0, s n *). Let Assumptions \B. lWB.3\ be satisfied. If, 
in addition, Assumption IA.il ( iv) holds, and there is y € 7o(f„)\{0}, such that 11)8^ || 2 < /oll/?yll 2 > where 
fo > is constant. Then as n — > oo, sup p(y|z) = °p(1)- ^ n words, 7 £j a better choice than the 

<P n <ci,...,c p <$„ 

null model. 



The proof of Theorem 13. 5 1 is given in the appendix. In Theorem 13.51 we make the assumption ||/J y o\ y ll 2 ^ 
/olljSyll 2 - Heuristically, ||/? y || represents the information of the model 7 and represents the information 

of the complement model y°\y. This assumption simply says that much of the information of the true model 
is concentrated on 7. Theorem 13 . 5 1 states that with this "information" assumption and Assumption |A.3l (iv), 
model 7 can successfully outperform the null model so that 7^0 with arbitrarily large probability. Note 
here Assumption IA.3I (iv) is necessary since otherwise with c,s approaching infinity the null model will 
be always preferred (see iPRl ). To the best of our knowledge, Theorems 13.41 and |3 . 5 1 are the first theoretical 
results in the fully Bayesian setting examining the selection performance with misspecified hyperparameters. 



3.3 Extensions to the g-prior setting 

In this section, we extend the results in S ections |3~il and [3721 to the g-prior setting. For simplicity, let all the 
variance-control parameters be the same, i.e., cj = c for all j = I,..., p. Instead of using a fixed c, we place 
over c a proper prior g(c), i.e., J Q g(c)dc = 1. Here we consider a broad functional class for g(c) including 
the variations of the Zellner-Siow prior proposed by [50 1 and the hyper g-prior proposed by ||29l . 

Assuming a random c € (0, 00), the conditional probability of 7 given (c, Z) is exactly 

p(y\c, Z) ex det(W r r 1/2 />(7) (l + Y r (I„ - X y U; 1 X£)Y)" ( * +v)/2 , (3.4) 

where W y and U y , both depending c, are defined as in (12.7b - Consequently, the posterior probability of 7, 
in the setting of g-prior, is given by 

p g (y\Z) = J p(y\c,Z)g(c)dc, (3.5) 
where the subscript g represents the posterior probability in the setting of g-prior. 

We will prove that p g (y\Z) shares similar probabilistic properties as those in Sections [XT] and [3T2J though 
a g-prior setting has been considered. ||29l obtained selection consistency in the g-prior settings where p 
is fixed. Their proof relies on an application of Laplace approximation of the posterior likelihood. Here 
we will use a different approach which relies on the uniform convergence results that have been derived in 
previous sections. Our first theorem below treats the case when t n e [s n , r n ] with r n > s n being some integer. 

Theorem 3.6 Suppose Assumptions \A. l\]A.3\ are satisfied. Furthermore, g is proper and satisfies, asn — > 00 , 
Jo~" g( c )dc — o{\) and L g{c)dc - o(l). Then as n — > 00, min p g (y \Z) — > 1, in probability. 
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Theorem [3^6] is proved in the appendix. It establishes model selection consistency under the g-prior setting. 
Again, this result uniformly holds for t n € [s n , r n ]. 

Our second and third results treat the case < t n < s„. The proofs are given in the appendix. They state 
that even when one misspecifies the t n such it actually lies in (0, s n ), the selected model may still be nested 
in the true model, and even nonnull. However, we are only able to show the desired results for those gs with 
compact support [cp , </>„], though we conjecture that these results may still hold for more general gs. 



Theorem 3.7 Suppose Assumptions \B. 1WB.3\ are satisfied. Furthermore, g is proper and supported in 

max p s (y|Z) 

[d> , (/>„], i.e., g(c) = if c £ [d> , d>„]. Then as n — > oo, max yeT[i '" "J 2i '" , l7 , > 0, in probability. 

~" ~ n 0<tn<S„ y™t, n) WW 

Theorem 3.8 Suppose we happen to choose some t n from (0, s n \ Let Assumptions \S. lWB.3\ be satisfied. If, 
in addition, Assumption IA.il ( iv) holds, and there is y e 7o(f„)\{0}, such that ||j3^ ^|| 2 < /oll/?yll 2 > where 
fo>0 is constant. Furthermore, g is proper and supported in [cf> , <p n ]. Then as n — > oo, ^ s |^^ — op(l). In 
other words, y is a better choice than the null model in the setting of g-prior. 



3.4 Generalized Zellner-Siow prior and generalized hyper-g prior 

In this section, motivated from [50] and [29] in fixed p scenario, two new types of g-priors will be proposed. 
The first one is a generalization of Zeller-Siow prior motivated from [50 1. The second one is a generalization 
of the hyper-g prior motivated from [29 1 . Both variations never appeared in literature and are nontrivial. 

The original form of Zellner-Siow prior is g(c) oc c -3 ^ 2 exp(— n/(2c)); see ||29l . However, as demonstrated 
in our simulation study, the accuracy of using this prior severely decreases in high-dimensional setting. The 
reason is, as revealed in the discussions in Remark I3T1 to achieve more accurate selection, one has to shift 
the range [(p , 0„] to be suitably large. A possible choice is to make both cp and fy n exponentially growing 
with n. To achieve selection consistency in the g-prior setting, one may choose g concentrated on [<p , (f> n ] 
(see Theorem [3]6]l, implying that the mode of g is, say, exponentially growing with n; see the original form 
of Zellner-Sior prior with mode n/3. This motivates us to consider the following generalized Zellner-Siow 
prior 

D ab n 

g(c) - c~ a - 1 exp( -p b "/c), (3.6) 
T(a) 

where a > 0, b n > are fixed hyperparameters. The prior in (13.6) is actually IG(a, p h ") with mode p b " j(a+\). 
A nice property of this prior is its conjugacy for which we can use a Gibbs sampler step to draw the c 
samples. A proper choice is a constant a > and b n x log n. With direct calculations we have J ~" g(c)dc = 
(T(a))" 1 JJ„ r , c a ~ l exp(-c)Jc and g(c)dc = (T(a))" 1 jf"^ c a ~ x exp(-c)dc. Thus, with^ = and 

(f>n = p h ", both integrals are o(l), i.e., g satisfies the condition in Theorem 13.61 Note that this condition is 
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violated for a = 0. Furthermore, it follows from the discussions in Remark |3~T1 that such choice of <b and d> n 

— n 

also fulfill Assumption I A. 31 for s n , \j/ n ,p specified therein. This shows that the prior (I3.6t can indeed induce 
consistent Bayesian selection. 



Next we intend to explore our second type of g-prior. Following 11291 , the motivation of the hyper-g prior is 
that the shrinkage factor c/(l + c) has most of the mass near 1, for which they assume c/(l + c) to have beta 
distribution with hyperparameters properly managed. However, as demonstrated in our simulation study, the 
hyper-g prior or the hyper-g/n prior considered in ||29l , though work well in lower-dimensional situation, 
does not work well in high-dimensional setting. The reason is similar to that for the conventional Zeller- 
Siow prior, i.e, the mode of these g-priors are not large enough to support high-dimensional selection. From 
this point of view, we consider c/(l + c) ~ Beta(a', ! , b), leading to the following generalized hyper-g prior 



T(a n + b) c a "~ l 
T(a n )T(b) ' (1 + c) a » H 



g(c) = 7^-^777 l) (3.7) 



where b > is constant and a n = p a " + l with a n x log n. Obviously, the mode of our generalized hyper-g 
prior is (a„ - l)/(b + l). With (p - p^" n and </>„ = p a % by direct calculations, it can be verified that 

&g(c)dc = 0{a b ~ x exp(-a„/(l +(/>))) = o(l) and g(c)dc - 0(a b J(\ + 0„)) - o(l). Therefore, the 
proposed generalized hyper-g prior also satisfies the assumptions in Theorem I3.6I implying the selection 
consistency. 

In implementations we simply choose a = b = to achieve the maximum prior modes for both generalized 
Zellner-Siow prior and generalized hyper-g prior, though they may violate the limit conditions in Theorem 
Our empirical results in Section |5]demonstrate satisfactory performance of such choice. 



3.5 Simultaneous credible intervals 

In many applications, model selection is just an initial step. After selecting the model, it is important to 
further make inference on the selected variables, e.g., constructing simultaneous credible intervals for the 
nonzero features. 

Suppose one has selected model 7, and the goal is to further build credible intervals for /5ys with jj = l. 
To ease technical arguments, we assume known o 2 and c ; s, yj = l for j = l,...,r, and yj = for 
j — r + l, . . . ,p. Therefore the hierarchical model becomes 

Y\p ~ N(Xfi, aX), Pi ~ (l - 7j)S + 7j N(0, c j( r 2 ). 

With straightforward calculations one can show that jS y follows ,/V(£, cr 2 U y l ), where £ - U^X^Y and U y is 
defined as in ( I2.7I ). Thus, the marginal posterior distribution for /?y for j = l, . . . , r is /Jy ~ N(£j, cr 2 .), where 
is the jth component of and cr 2 . is the jth diagonal element of cr 2 U~ l . The 100 X (l - a)% credible 
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interval for /?y is thus 

CIj:fj±c a/2 o-j, j=l,...,r, (3.8) 
where c a /2 is the lower (a/2)-th quantile of the standard normal distribution. 

To see the performance of the intervals Clys, we use the concept of Bayes false coverage rate (FCR) consid- 
ered by (53]. Namely, let V be the number of the Clys which do not cover /3j. Then FCR = E{ V/r}. Since 
for any j = 1, . . . , r, P(J3j g CIy|Z) = a, it follows by Theorem 2 of [53 ] that FCR < a. In other words, the 
Bayes FCR of the simultaneous credible intervals constructed in ( 13.81 ) can be controlled at arbitrary nominal 
level a, though a smaller a would enlarge the Clys simultaneously. 



4 Computational details 

In this section we present the sampling details. In Section 14.11 we fix cys and demonstrate how to use 
MCMC to draw samples from f},y,cr 2 ,t n . In Section |4~2l we discuss various ways of handling the c ; s 
including using BIC or RIC in which the cys are fixed a priori, or using an additional MCMC step to draw 
samples from cys in a g-prior setting. 

4.1 A constrained blockwise Gibbs sampler for automatic and stochastic model search 

In previous sections t n , i.e., the size-control parameter in (|2.5) , is a fixed integer. Though the theory holds 
uniformly for t n within certain range, practically one still has to choose a proper one to facilitate com- 
putation. To address this difficulty, we further place a prior on t„. Specifically, to play simple, we let 
p{t n ) - I(t n < m n ), i.e., a uniform prior on [l,m„] with m n being a predetermined integer, though other 
choices with more complicated forms can also be used, which induces corresponding revisions in the fol- 
lowing algorithm. With this prior and based on the Bayesian hierarchical model (|2.1| i- (|2.5l ), the posterior 
distribution is 

p(J3,r,(r 2 ,t n \Z) oc p(fi,y,cr 2 \t n ,Z)p(t n ), (4.1) 
where pifi, y, cr 2 \t n , Z) is exactly given in ( 12.61 )- Temporarily all the c ; s are fixed hyperparameters. 

We will present an efficient Gibbs sampler to draw posterior samples from (14. 1 ) . In conventional Gibbs sam- 
pler one draws samples iteratively and separately from the full conditionals p(fi\y, o 2 , ?„, Z), p{y\fi, o 2 , t n , Z), 
p(cr 2 \fi,y,t n ,Z) and p(t n \fi,y,cr 2 ,Z). However, for our specific Bayesian model, it can be shown that 
both the full conditionals for jS and y involve intensive matrix inversion computation, an extremely time- 
consuming step when data dimension is large or a long Markov chain needs to be sampled. To ease the 
matrix inversion computation, ||27l used a novel technique in structured high-dimensional model which re- 
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duces computing time. Here we will adopt a different approach that fully avoids the computation of the 
inverse matrices. 



To improve sampling speed, we propose a constrained blockwise Gibbs sampler motivated from [23 1 and 
|48l . The basic idea of the original blockwise Gibbs sampler is to treat each two-dimensional vector g • = 
(fij,jj), for j = l,...,p, as a block. Instead of sampling fi and y separately, we draw them together 
through sampling the blocks gjs iteratively. A nice property of the blockwise Gibbs sampler is that it 
effectively avoids matrix inversion computation, and therefore is more computationally efficient. However, 
our specific prior on the model space, i.e., the inclusion of the hyperparameter t n that controls the model 
size, induces nontrivial modifications in this method. Specifically, during the sampling process, to fulfill the 
blockwise technique, the size of the sampled model from the previous iteration has to be less than t n , which 
is essentially a constrained version of the blockwise procedure. In practical implementations, we further 
allow a stochastic draw from t n , i.e., an automatic and stochastic control of the model sizes during posterior 
sampling, which makes our procedure even more flexible. 



From (12.6b . the joint posterior of gj, . . . , g_ given a and t„ is 



P(gi.-.-»g p |o" A,Z) 



2 

oc (27TCT ) 2 p(y)e\p\ 



\\Y-Xfi\\ 2 +fifc l fiy 

2cr 2 



(4.2) 



Denote g_j = {g l , . . . , g,-_ l5 g, +1 , . . . , g p }. If |y_,-| > t n , i.e., the number of indexes k with k ± j and yt = 1 
is greater than t n , then the posterior probability in (14. 2) becomes zero. So we only consider \y -| < t n . To 
ease the technical arguments, suppose for each g k = (flu, yt) with k + j, and [3^ "match" each other in the 
sense that % = 1 if + 0, and y\ = if fik - 0. It can be shown directly from (14.2t that 



p(gj\g-j,(r 2 ,t n ,Z) 

7 - y -l 

oc {IncjO ) 2 p(yj,y_j)exp 



\\Y-xp\?+pp:-% 



2a 1 - 



n 6 °w 



ye-y 



oc (Incjcr*) 2 p(yj,y_j)exp - 



2(T 2 



■ Y] 6 (fij), (4.3) 



where w 7 - - (Y - X-jfi_j) T Xj and X_ ; - = (Xi, . . . , Xy_i, X 7+ i, . . . , X p ) for y = 1 p. 

We first consider \y_ -| < In this case, p(7j, y_ ,) is always positive since the size of (yj, /_,), i.e., yy+|y_,-|, 
does not exceed f„. From (14- 3b . it can be shown that the full conditionals of (yj = and (y,- = 0,ySy) are 
respectively 



= l,Pj\g-j,(r\t n ,Z) oc (2;rcy(r z )~ 1/2 j p(y; - l,y_ 7 -)exp 



2(T 2 



(4.4) 
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where v 2 . - X^X; + c. 1 , and 

3 J J J 



p{jj = 0,/^lg^, a 2 , t n , Z) oc p{jj = 0, 7_ y ) exp 



2cr 2 



(4.5) 



Integrating out fij in (14.41) and (14.51 ). one obtains the marginal distribution for yj given by 



2 P(7; = 1.7-/) 
P(7j = o- ,f„,Z) « — exp 



c j v j 



and 



P(7j = 0|g_y, cr , f„, Z) cx p( 7j = 0, 7_ ; ). 
From (14.61 ) and ( 14.71 ), we can draw yj marginally through 



p(yj = 0\%_ j ,o 2 ,t n ,Z) 



1 



1 -i- i p(y/ =1 .r-;) 
+ ' p(r;=o,y- ; ) ' exp 



2a- \ 



(4.6) 



(4.7) 



(4.8) 



Then by ( 14.41 ) and (14.51) , we sample /Jy through the following marginal conditional distributions 



Pj\Yi = l,g_j,(T 2 ,t n ,Z~N 



U A _ 

2 ' 2 

V ■ V ■ 

V J 1 ) 



(4.9) 
(4.10) 



^06 ; -O|ry = O,g_ y , C r z ,? n ,Z)= 1. 

Through (I4.8M4.10I ). we can draw sample g - from p(g,|g_j, o~ 2 , t n ,Z) in the setting < t n , for 7 - 
l,...,p. 

When \y-j\ = t n , it follows directly from (14.21 ) that p(yj = l,fij,g_j\o- 2 ,t n ,Z) = 0, implying p(yj = 
0\g_j,cr 2 ,t n ,Z) = 1. Using (g3]>, it can be shown that p{fij = 0\yj = 0, g_ ; , cr 2 , t n , Z) = 1. In other words, 
we simply set /3 ; = to match its binary state yj, by which we can control the model sizes to be not 
exceeding t n . We should mention that this additional "size-control" step does not appear in conventional 
lower-dimensional Bayesian model selection; see [23 ] or [48] for comparison. Here we need it to address 
the ultrahigh dimensionality. 



(4.11) 



From (I2.6I ). it can be verified that the full conditional of cr is given by 



/Kcr 2 b8,7,f«,Z)cx(cr 2 )- 



exp 



2cr 2 



that is, cr 2 \fi,rJn,Z ~ IG ^^ W-™ 2+ #^% +1 > j t where iG(a,b) denotes the inverse-gamma distri- 
bution with density n(x) oc x~ a ~ x exp(-b/x). Finally, given f},y,cr, it is easy to see that t n is uniform in 
[\y\,m n ]- 

To conclude, we summarize our Gibbs sampler in a fashion that can be applied directly in programming. 
Set the initial stage y* 0) = 0, = 0, for j = l,,..,p, cr 2 0) to be a random selected positive number, and 
to be uniform over [1, m n ]. Suppose we have sampled (y®,j8®, cr 2 ^, from the Zth iteration. 
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(i). Suppose, in the (/+ 1 )th iteration, we have sampled the first j—l blocks, i.e., gj' +1) = (J3^ +l \ y ( / +1) ), . . . , 5 
Denote y_ f = (y^, . . -,y^,y%,. ..,yf) and/L, = 0™, . . . - - 

To generate gy +1) = 7jf ). we use me following procedure: 

(1). If \y_j\ < tf, then set yf l) = with probability where Oj = ^=U • jgg^j ■ e*P (^l 
- - - « - — ' XjX, + c-. 

, then set /3 { j +l) = 0. 



If y (,+1) - 1, then draw jS (Z+1) from AH %, 4|. Else, if y (/+1) = 0, 
(2). If |y_ y | - f®, then set y\ /+1) = and = 0. 



(ii). After finishing (i) for all j = 1, . . . ,p, denote y(' +1) and to be the current update of y and /?. 
Draw o"(/ +1 ) from 



IG 



2 ' 2 



(hi). Draw uniformly over [|y^' +1) |, m„]. 



4.2 About the c ; s 

The choice of CjS plays an important role in practical implementation of our method, and therefore they must 
be well addressed. In our numerical study, we chose cj = c, a constant hyperparameter for all j = I,..., p, 
though to ease the application they can be chosen as different numbers if we priorly have preferences over 
certain coefficients. 

There are several popular ways of finding c including BIC, RIC (see fT3ll ), and the Benchmark prior method 
(see ifTTl ). In these methods c is fixed as n, p 2 , and max{«, p 2 } respectively. An alternative way is to 
avoid finding c by assuming c to follow the g-priors such as the ones introduced in Section 13.41 though an 
Metropolis-Hasting step might be needed to draw the c samples. 

Suppose g(c) is a proper prior over c, then the full conditional of c can be derived directly by 

p(c\fi, y, a 2 , t n , Z) cx p (fi\y, c, a 2 )g{c) cx c^ 12 exp f-^j) 8(c)- (4. 12) 

When g is the generalized Zellner-Siow prior specified by (I3.6l l. (14.121 ) has a closed form. Explicitly, c 
follows IG (a + \y\/2,p b - + \\fi y \\ 2 /(2cr 2 )). 

When g is the generalized hyper-g prior specified in (13.71 . (14. 12b does not have a closed form. In this case, 
we have to incorporate an Metropolis-Hasting step. Technically, we reparametrize k = log c. Then the full 
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conditional of k is p(k\/3, y, o 2 , t n , Z) - /? c (exp(/c)|/3, y, o 2 , t n , Z) • exp(A-), where p c (-\fi, y, o 2 , t n , Z) denotes 
the full conditional of c specified as in (14- 1 2b . With k i& being the current value of k, then generate K new 
from N{K id, a 2 .), i.e., a normal proposal, with cr 2 being a fixed priori. Then we accept K new with probability 

P(Knew\fi, 7> 0" 2 > <n> Z)/ 'p(K ld\P, 7, CT 2 , t„, Z). 

5 Simulation study 

In this section, a simulation study is conducted to compare the performance of different methods. In Ex- 
ample [5J] we compare our approach based on the generalized Zellner-Siow (GZS) and generalized hyper- g 
(GHG) priors with several popular Bayesian methods. Specifically, we examined the posterior probability 
of the true model using different approaches. We also looked at the FCR and length of the simultaneous 
credible intervals constructed using the GZS and GHG priors. In Example 15-21 we compare our approach 
with SIS -SCAD and ISIS -SCAD considered by lfT4ll . The median size of the selected models and median 
estimation error are reported. 

5.1 Example 1 

For the first simulation, the data were generated from Y = X/? + e with e ~ N(0, o~ 2 I„)- The entries 
Xjjs of X are standard normal with the correlation between Xy, and Xtj 2 being pUi-M } i.e., the AR(1) 
model. To better examine the performance, we considered a variety of situations cr 2 = 1,2, (n, p, s n ) - 
(100, 15, 2), (200, 15, 2), (100, 1000, 10), (200, 1000, 10), and p = 0, 0.5. The choice of p represents inde- 
pendence and relatively higher correlation among the predictors. Note in these situations RIC and the 
benchmark prior method by [17 ] coincide with each other so we only considered RIC. The true model coef- 
ficient is 6° = (u + n , u~ n , 0„_i ) T for s„ = 2, 10, where 0„_ 4 is the (p - s„)-dimensional zero vector, u + n 

S n jZ. S n jL f n ' 1 n S n /Z 

(u~ n ) is the (5„/2)-dimensional vector with components uniformly generated from [1,5] ([-5, -1]). 

S n j Z 

We fixed v = 6 in (12.61 ) somewhat arbitrarily though we found other choices also performing well. The 
prior on t n was set to be p{t n ) - I(t n < n/2), a commonly acceptable prior sparsity assumption in many 
high-dimensional problems. For GZS defined as in (I3.6I ). we chose a = and b n = d; for GHG defined as 
in ( 13.71 ), we chose a n - p d + 1 and b = 0. To examine sensitivity, we considered d = 2.8, 3, 3.2, and denote 
the corresponding GZS and GHG priors as GZS2.8, GZS 3, GZS3.2 and GHG2.8, GHG3, GHG3.2. Our 
study relied on N = 100 replicated data sets Z( v ) = (Y( V ), X( V )) for v = 1, . . . , N. Based on each data Z( v ), we 
generated 10000 samples from the posterior distribution based on any of the above mentioned approaches 
in a variety of settings. The first 5000 samples served as burnins, and the second half were used to conduct 
computation. It takes about 440.30 seconds to generate 10000 posterior samples when p = 1000 using 
the parallel computing techniques on a computer with 16 CPUs and 256 GB Memory. Convergence of the 
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Markov chains was monitored by Gelman-Rubin's statistics; see lfT8l . 

The results contain two parts. First, we examined the empirical posterior probability of the true model using 
BIC, RIC, Zellner-Siow (ZS), hyper-g (HG) and hyper-g/72 (HGN) priors that were considered in [29 1, and 
GZS, GHG introduced in Section [3~4l The empirical proportion of the true model, denoted as /?(y°|Z( V )), is 
an estimate of p(y°|Z (v) ). For 77 € (0, 1), define F(j]) = #{1 < v < N\p(y°\Z (v) ) > rf/N. That is, 1 - F{j]) 
is the empirical distribution function of /?(y°|Z( V ))s. Since p(y°\Z( v) ) > 0.5 implies that the true model is 
selected, F(0.5) measures the selection accuracy. To further examine how significantly the true model is 
selected, we also looked at F(0.9), i.e., the empirical proportion of p(y°\Z( v) ) greater than 0.9. For each 
of the above mentioned situations, we examined F(rj) for rj = 0.5,0.9. Obviously, the larger value of F{rf) 
indicates better performance. 

Our empirical finding (based on the R package BAS provided by www.stat.duke.edu/~clyde/BAS) reveals 
that the value of the hyperparameter in HG and HGN recommended by 11291 cannot yield high value (close 
to 1) of /?(y°|Z( V )), though correct model selection can still be achieved since it was found to be greater than 
p(y\Z,( V )) for any y + y°. For this reason, we chose the hyperparameter in HG and HGN to be 0.1 to achieve 
higher value of /?(y°|Z( V )) (see TableQ}. The code was written in Matlab and is available upon request. 

Table Q] summarizes the values of F(0.5) and F(0.9). We found that all the approaches demonstrate satis- 
factory performance when (p, s n ) = (15, 2). With cr 2 and p increasing, the selection performance is slightly 
affected but overall is accurate enough. When (p, s n ) = (1000, 10), BIC, HG, ZS and HGN cannot select 
the correct model, while GZS and GHG can still accurately select the true model. The worst situation is 
cr 2 = 2,p = 0.5, in which F(0.5) all decreases to 0.80-0.90. Somewhat surprisingly, RIC can still achieve 
values of F(0.5) up to 0.70 when n = 100, and even up to 0.90 when n = 200. This is because RIC 
fixes c = p 2 , a large number to yield more accurate selection. However, it cannot give positive values of 
F(0.9), indicating insignificant selection of the true model. In contrast, both GZS and GHG can give values 
of F(0.9) over 0.80 when n = 100, and even over 0.90 when n = 200. The results also demonstrate that 
selection accuracy of GZS and GHG appears to be not much sensitive to d e [2.8, 3.2] in all of the situations. 

Second, we computed the FCR and the length of the 95% credible intervals for the selected coefficients 
(based on the highest posterior probability model), when GZS and GHG with d = 2.8, 3, 3.2 were used. 
The 95% credible intervals were constructed using the formula ( 13.81 ). The posterior estimates of c and cr 2 , 
obtained through posterior averages of the c and cr 2 chains, were plugged in to obtain the intervals. We 
should point out that the credible intervals, together with the empirical posterior probability of the true 
model, were jointly obtained through the posterior samples. In other words, model selection and credible 
interval construction were jointly achieved, reflecting the "one-step" feature of the method. Based on the 
100 replicated data sets, the FCR was calculated as the mean false coverage proportions, and the average 
length was recognized as the mean length of the intervals for the selected coefficients. 
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n= 100 



n = 200 



(p, s„) = (15, 2) (1000,10) (15,2) (1000,10) 



o- 2 p Method F(0.5) F(0.9) F(0.5) F(0.9) F(0.5) F(0.9) F(0.5) F(0.9) 



1 


BIC 


0.94 








I 


0.31 








RIC 


0.97 


0.05 


0.73 




0.99 


0.38 


0.98 






ZS 


0.85 








0.98 










HG 


0.96 


0.56 






0.96 


0.71 








HGN 


0.94 


0.53 






0.98 


0.82 








GZS2.8 


0.99 


0.82 


0.99 


0.79 


1 


0.99 


1 


0.92 




nun? 8 


0.99 


0.82 


0.96 


78 


1 


0.97 


1 


0.98 




GZS3 


0.99 


0.90 


0.99 


0.89 


1 


0.99 


1 


0.97 




GHG3 


0.98 


0.90 


1 


0.94 


1 


0.96 


1 


0.97 




GZS3.2 


1 


0.86 


] 


0.93 


1 


0.95 


1 


1 




GHG3.2 


0.97 


0.91 


0.96 


0.90 


I 


0.93 


I 


0.99 


0.5 


BIC 


0.90 








0.97 


0.24 








RIC 


0.94 


0.03 


0.70 




0.97 


0.31 


0.96 






ZS 


0.75 








0.95 










HG 


0.95 


0.49 






0.94 


0.68 








HGN 


0.92 


0.56 






0.98 


0.84 








G7S2 8 


0.98 


0.80 


0.95 


0.70 




0.90 


1 


0.91 




GHG2.8 


0.98 


0.82 


0.96 


0.78 


0.98 


0.89 


1 


0.95 




GZS3 


1 


0.91 


0.97 


0.86 


1 


0.95 


1 


0.97 




GHG3 


0.99 


0.89 


0.94 


0.87 


1 


0.92 


1 


0.97 




GZS3.2 


0.98 


0.87 


0.95 


0.91 


1 


0.97 


1 


0.99 




GHG3.2 


1 


0.93 


0.95 


0.92 


1 


0.98 


1 


0.98 


2 


BIC 


0.93 








0.98 


0.23 








RIC 


0.95 


0.02 


0.74 




1 


0.34 


0.96 






ZS 


0.83 








0.97 










HG 


0.92 


0.36 






0.97 


0.57 








HGN 


0.84 


0.38 






0.90 


0.50 








G7S2 8 


1 


0.83 


0.93 


0.72 


1 


0.90 


1 


0.97 




vjn.v_Jz..o 


97 


0.76 


98 


83 


1 


90 


98 


0.93 




GZS3 


0.99 


0.93 


0.93 


0.82 


1 


0.95 


I 


0.95 




GHG3 


1 


0.82 


0.98 


0.88 


1 


0.94 


1 


0.97 




GZS3.2 


0.99 


0.92 


0.96 


0.92 


1 


0.93 


1 


I 




GHG3.2 


I 


0.94 


0.95 


0.90 


I 


0.95 


I 


0.98 


0.5 


BIC 


0.90 








0.94 


0.23 








RIC 


0.93 


0.01 


0.64 




0.95 


0.30 


0.90 






ZS 


0.80 








0.94 










HG 


0.83 


0.37 






0.95 


0.51 








HGN 


0.81 


0.34 






0.98 


0.50 








GZS2.8 


0.98 


0.82 


0.90 


0.65 


0.98 


0.91 


0.98 


0.90 




GHG2.8 


0.99 


0.86 


0.88 


0.58 


1 


0.90 


0.97 


0.92 




GZS3 


0.98 


0.87 


0.85 


0.68 


1 


0.91 


0.93 


0.91 




GHG3 


0.99 


0.88 


0.87 


0.79 


1 


0.90 


0.99 


0.91 




GZS3.2 


1 


0.88 


0.89 


0.71 


1 


0.94 


0.93 


0.90 




GHG3.2 


0.97 


0.92 


0.84 


0.74 


1 


0.98 


0.96 


0.90 



Table 1: Values of Fir]) for rj = 0.5,0.9 in various settings. " — " indicates a zero-value. 
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Table |2] summarizes the results. We observed that the FCRs are all controlled by 5% except for (cr ,p) - 
(2, 0.5). This is consistent with the finding by ll53l who showed that the FCR of the simultaneous credible 
intervals can be controlled by the nominal level for constructing the intervals, when signal-to-noise ratio 
is reasonably large. When (cr 2 ,p) = (2,0.5), FCR tends to be around 10% reflecting the effect of higher 
correlation and model error. As n increases, or p and o~ 2 decrease, the average lengths of the credible 
intervals for the selected coefficients become shorter. The results also reveal that using GZS and GHG with 
different choices of d e [2.8,3.2], the performance of the simultaneous credible intervals appears to be not 
much sensitive, at least in this simulation. 







n = 


100 


n = 


200 


<T 2 p 


Method 


(p,j„) = (15,2) 


(1000, 10) 


(15,2) 


(1000, 10) 


1 


GZS2.8 


5.50 (38.93) 


7.40 (38.58) 


5.17 (27.30) 


5.90 (27.29) 




GHG2.8 


5.67 (38.98) 


7.55 (37.82) 


3.17 (27.72) 


6.49 (27.44) 




GZS3 


5.50 (38.94) 


7.40 (38.67) 


5.17 (27.30) 


5.90 (27.31) 




GHG3 


5.67 (39.00) 


7.37 (37.97) 


3.17 (27.73) 


6.59 (27.45) 




GZS3.2 


5.50 (38.96) 


7.40 (38.72) 


6.50 (27.30) 


5.90 (27.29) 




GHG3.2 


5.67 (39.00) 


7.27 (38.07) 


3.17(27.73) 


6.50 (27.46) 


0.5 


GZS2.8 


3.00 (44.58) 


7.04 (47.62) 


4.00(32.18) 


4.80 (35.10) 




GHG2.8 


6.83 (44.59) 


7.33 (47.10) 


5.50 (31.73) 


6.80 (35.01) 




GZS3 


3.00 (44.60) 


6.49 (47.67) 


4.00(32.18) 


4.80 (35.13) 




GHG3 


6.83 (44.61) 


7.05 (47.14) 


5.50 (31.75) 


6.80 (35.02) 




GZS3.2 


3.00 (44.62) 


6.49 (47.79) 


3.50 (32.19) 


4.80 (35.17) 




GHG3.2 


6.83 (44.61) 


7.05 (47.29) 


5.50(31.75) 


6.80 (35.02) 


2 


GZS2.8 


6.50 (54.61) 


6.00 (54.52) 


4.00 (38.89) 


6.40 (39.66) 




GHG2.8 


5.17 (54.26) 


7.37 (56.52) 


6.50 (38.36) 


5.99 (40.17) 




GZS 3 


6.50 (54.62) 


6.22 (54.71) 


4.00 (38.89) 


6.50 (39.68) 




GHG3 


4.83 (54.27) 


7.11 (56.72) 


6.50 (38.36) 


5.90 (40.19) 




GZS3.2 


6.50 (54.64) 


6.22 (54.91) 


4.00 (38.90) 


6.40 (39.68) 




GHG3.2 


4.83 (54.28) 


7.01 (56.91) 


6.50 (38.36) 


5.90 (40.21) 


0.5 


GZS2.8 


6.00 (63.07) 


8.59 (65.31) 


6.83 (45.30) 


5.60 (48.37) 




GHG2.8 


4.50 (62.33) 


8.68 (68.33) 


4.50 (45.09) 


6.10(49.16) 




GZS 3 


6.00 (63.08) 


9.09 (65.42) 


6.83 (45.30) 


5.50 (48.37) 




GHG3 


4.00 (62.35) 


9.96 (68.72) 


4.50 (45.09) 


5.92 (49.16) 




GZS3.2 


5.50(63.10) 


9.19(65.71) 


6.83 (45.30) 


5.60 (48.35) 




GHG3.2 


4.50 (62.35) 


9.90 (68.84) 


4.50 (45.09) 


5.92 (49.16) 



Table 2: WOxFCR ( lOOxaverage length) of the 95% credible intervals for the selected coefficients constructed by GZS and GHG 
in various settings. 

5.2 Example 2 

In our second study, we adopted two simulation settings in [14]. In Setting I, N — 200 data sets were 
generated from Y = Xj8 + e with e ~ Af(0, 1.5 I„), where X is n x p containing i.i.d standard Gaussian 
entries. We considered (n,p, s n ) = (200, 1000, 8) and (800, 20000, 18), where recall s n represents the size 
of the true model. In each data replication, the s n nonzero coefficients were chosen to be (-l)"(a + \z\), 
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where u was drawn from Bernoulli distribution with parameter 0.4, z was drawn from standard Gaussian 
distribution, and a = 41og«/ ^Jn and 51og«/ ^fn corresponding to the two situations. In [14|, the median 
size of the selected models and the median of \\fi - jS°|| obtained from SIS-SCAD and ISIS-SCAD were 
reported. In Bayesian approaches, we also looked at the median size of the selected models with the highest 
posterior probability, and the median of ||yS -jS°||, where /? was found by posterior mean of the ft samples. 
To demonstrate how stable the posterior estimate is, we also looked the standard deviations of \\/3 - y6°||s. 
We fixed v - 6 in (12.6b - The prior on t n was set to be p(t n ) = I(t n < n/2). Due to computational cost, 
we generated Markov chains with length 4000 and 1000 for (n, p, s n ) = (200, 1000, 8) and (800, 20000, 18) 
respectively. Using Gelman-Rubin's statistics, we found that the Markov chains appear to mix well. 

In Table [3l we compared the median size of the selected models (MSSM) and the median of the error 
\\fi -j8°|| (ME) obtained from SIS-SCAD, ISIS-SCAD, and the proposed Bayesian method with GZS3 and 
GHG3 priors, in Setting I. The performance of GZS and GHG priors with d = 2.8 and 3.2 is similar, and thus, 
was not reported. Results based on SIS-SCAD and ISIS-SCAD were summarized from [14|. We observed 
that all the four methods yield satisfactory accuracy in coefficient estimation, and GZS3 and GHG3 perform 
slightly better in yielding the correct model size. The standard deviations of \\/3 - fi°\\ using both GZS3 and 
GHG3 priors are around 0.08 and 0.04 (for p = 1000, 20000), reflecting the stability of the two approaches. 





SIS-SCAD 


ISIS-SCAD 


GZS 3 


GHG3 


(200,1000,8) 


15 (0.374) 


13 (0.329) 


8 (0.2811) 


8 (0.2806) 








(0.0784) 


(0.0783) 


(800,20000,18) 


37 (0.288) 


31 (0.246) 


18 (0.2252) 


18 (0.2257) 








(0.0329) 


(0.0360) 



Table 3: MSSM and ME based on SIS-SCAD, ISIS-SCAD, GZS3 and GHG3 for Setting I. For SIS-SCAD and ISIS-SCAD, the 
numbers in the parentheses represent the MEs. For GZS3 and GHG3, the numbers in the parentheses represent MEs ( upper) and 
standard deviations of\\fi — (lower). 

In Setting II, /V = 200 data sets were generated from Y = XjS + e with e ~ N(0, o~ 2 I„). We considered three 
situations (n,p,s n ) = (200, 1000, 5), (200, 1000, 8), (800, 20000, 14). Correspondingly, we chose (cr,a) = 
(1, 21og nj ^Jn), (1.5,41og nj ^n), (2, 41og nj ^fn). The true coefficient vector /i° was generated using the 
same strategy described in Setting I. The major difference in Setting II lies in generating the X matrix. 
Explicitly, the s n predictors X\ , . . . , X Sn were generated from 7V(0, A) for some positive definite covariance 
matrix A with condition number ^nj log n. The procedure for producing A was described in lPT4l . Then we 
drew W Sn+x , ...,W p from 7V(0, I p - Sn ), set Xj - Wj + rX hSn for j = s n + 1, . . . , 2s n , and Xj = Wj + (1 - f)Xi 
for j = 2s n + 1, . . . , p. Here r = 1 - 4 log nip, 1 - 5 log n/p,l - 5 log n/p for the three situations. We still 
fixed v - 6 in ( I2.6I ). The prior on t n was set to be p(t n ) = I(t n < n/2). The Markov chains have length 4000 
and 1000 for p = 1000 and 20000 respectively. The chains appear to converge based on Gelman-Rubin's 
statistics. In TableHl the MSSMs and the MEs of ||j8 - j8°|| obtained from the four methods in Setting II were 
summarized. Although the covariate variables now have certain dependence structure, all the four methods 
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still perform well. In particular, GZS3 and GHG3 yield more satisfactory selection and estimation accuracy, 
and produce stable results. 



SIS-SCAD 



ISIS-SCAD 



GZS3 



GHG3 



(200,1000,5) 



21 (0.331) 



11 (0.223) 



5 (0.1570) 5 (0.1559) 

(0.0478) (0.0477) 

8 (0.2947) 8 (0.2959) 

(0.0732) (0.0731) 

14(0.2633) 14(0.2631) 

(0.0543) (0.0466) 



(200,1000,8) 



18 (0.458) 



13.5 (0.366) 



(800,20000,14) 



36 (0.367) 



27 (0.315) 



Table 4: MSSM and ME based on SIS-SCAD, ISIS-SCAD, GZS3 and GHG3 for Setting II. For SIS-SCAD and ISIS-SCAD, the 
numbers in the parentheses represent the MEs. For GZS3 and GHG3, the numbers in the parentheses represent MEs ( upper) and 
standard deviations of\\J3 — fy\\ (lower). 

6 Conclusions 

We examined posterior consistency of a fully Bayesian method in sparse high-dimensional settings. As 
revealed in our main results, the prior (12.5) plays an important role. This prior plays the same role as 
a dimension reduction step. The difference is, unlike other methods in which dimension reduction is a 
separate step, using ( 12.51 ) dimension reduction is fulfilled automatically and stochastically in the process of 
Bayesian model fitting and MCMC search, and thus, all the statistical procedures are conducted in a unified 
framework. This "one-step" fashion differs our method from the existing ones. 

Tables \T\ and |2] demonstrate the numerical performance of the proposed method. Overall, the performance 
is not much sensitive to the choice of hypeparameter d in GZS and GHG. In practice, we recommend to use 
d = 3 which, at least in our simulation settings, display satisfactory accuracy. Other choices close to it yield 
not much different results. 

Two extensions of our method to other scopes are worth mentioning. The first one is the high-dimensional 
Gaussian graphical model in which the goal is to find the associated genes through estimating the sparse 
precision matrix. As is well known that this problem can be solved by Bayesian model selection approach 
in a completely different setting; see (H and the references therein. It is possible that we can apply a prior 
similar to (12.51 ) to control the size of genes during the model fitting and conduct a stochastic search to find 
the associated genes. 

The second direction that we intend to explore is whether our approach can be extended to generalized lin- 
ear models with high-dimensionality. Ideally, a fully Bayesian framework endowed with MCMC is possible 
to simplify the selection procedure, and meanwhile, conduct estimation and inference over the selected 
variables. It remains open whether such computing methods can be proposed in more general modeling 
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framework. It is well known that in generalized linear model the posterior distribution of the model does 
not have closed forms. A common method is to apply Laplace approximation; see, e.g., B71 . However, 
as pointed out by BT1 that the approximation error cannot be easily controlled in higher dimensional set- 
tings. An alternative way might be first showing uniform convergence of the posterior probability by fixing 
certain hyperparameters like Theorems I3.21 - I3.5I then generalizing this to more broader situations where the 
posterior probability can be expressed as an intractable integral; see, e.g., Section 1331 



APPENDIX: Proofs 



To prove Theorem 13 .21 we need the following preliminary lemma. 



Lemma 1 Suppose e ~ N(0, crf,I n ) and recall the true model is Y — Xfil + e. 



( i). Let v y he an n- dimensional vector indexed byy e S, a subset of the model space. Adopt the convention 
Vyel\\Vy\\ - when v y 



that vT, e/||v y || - when v y — 0. Let #S denote the cardinality of S with #S > 2. Then 



\vie\ 
max 



yeS ||V y | 

In particular, let v y = (/„ - Py)X Y o\ y py ! ^ for y e S2(t n ), we have 



Op ( Vlog(#<S)) . (6.1) 



\vle\ 



max max = P ( ^r n \ogp). (6.2) 

s„<t n <r„ yeS 2 (t„) \\V Y \\ 

( ii). For any fixed a > 2, 

lim P max max e T (P v - P v a)el{\y\ - s n ) < acr\ log p \ = 1. 

(Hi). Adopt the convention that e T P Y e/\y\ = when y is null. Then for any fixed a > 2, 



lim P max max e T P v e/\y\ < acr^logw = 1. 

n^oo \ ( „e[ 4 „,r„] yeS 2 (t„) U / 



Proof of Lemma U 



The proof of (ii) and (iii) is a trivial modification of Lemma A.l in (39). Next we only show (i). For any 



v r + 0, 



'7 ■ cro||v y .. 

constant Cq. Therefore, 



tf(0, 1). By (9.3) of lEl, if f ~ A^(0, 1), then P(|^| > t) < C exp(-f 2 /2) for some positive 



max 

yeS 



ivlel 



>( r CVlog(#<S) <J^p\-I- >( roCiJ\og(#S) 

I yeS V l|V >' 11 



< c #s ■ (#sy 



-C 2 /2 



C (#5) 



l-C 2 /2 
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which is small when C > is chosen as sufficiently large. This shows (16.ll ). To show (I6.2I ). consider 
■S = U S 2 (t„). Clearly 5 c S 2 (r„). Note #5 2 (r„) < (?) + ...( p ) < £ 79'//! < //". Thus we have 
#S < p r ". Thus, plugging this into (16.ll ). we get (I6.2I ). 



Proof of Theorem E2 



The idea of the proof is to derive explicit upper bounds (uniform for the variance-control parameters cjs) 
for the ratio ^nljzj . where 7 + y°. By showing that the sum of these upper bounds converges to zero, and 
using the trivial fact p(y Q \Z) = 1 , |Z) , we will conclude p(y°\Z) — > 1. Throughout the proofs of our 



1+ Z 



gyp 
p(r°]Z) 



theoretical results, we use the shortcut "w.I.p." to denote the terminology "with large probability". For any 
s n < t n < r n , We consider the following decomposition for 7 e S \{t n ) \JS2(t n ), 

l+Y^-XyU^XpY^ 



/ p(y\Z) \ ( p(y) \ 1 / det(W y ) \ n + v 



71 + V . 



log 



n+Y^-X^U-JX^Y 
1 + Y r (I„ - P y o)Y 



n + v , 



1 + Y r (I„ - P y )Y j 

^ l+Y r (I„-P y )Y x 
\+Y T {l n -Y y o)Y j 



Denote the above five terms by I\,l2,h,h,h- Next we approximate these terms respectively. 

By Assumption IA.ll I\ is bounded from below. Since U y > P r , ^3 > 0. By assumption k n - 0((f> ), 
rufr% — > 00, and the proof of Theorem 2.2 in ||39l , < -I4 = Op(l). Next we approximate I5. For 7 e SaOnX 
let v 7 = (I„ - 'Py)XyO\yflP ^ . Note Assumption |A.3l (iii) implies r n log p = o(n^). By Assumption lA.il it 
can be shown that 

l|v r || 2 = (/3%/X^O, - P 7 )Xy ^ r > nc Q l W\/ > nc~ Vn- 

Note (I„ - P y )X y o = (0, (I„ - P y )X y o\ y ). Then by Lemma[T](i) and (hi) we have for some fixed a > 2, w.I.p, 
for s n < t n < r n and uniformly over 7 € S^feX 



Y T (I n - P r )Y 



"7112 



+ 2vle + e T (I n - P y )e 



\VyWj - 4o- ||v r || ^r n \ogp + e T e - acrl\y\ log p 



>y\\2 



1 - 4o "0 — 7, — 7, <xcr n - — ^ 



T 

+ € € 



= ||v y ||2(l+ (l)) + e r e 
> «c VS(l +o(l)) + e T e. 

Since s n = o(n), e T (I„ - P y o)e = ncr~(l + op(l)). Therefore, there exists a constant C > (not depending 
on 7) such that w.I.p., for s n < t n < r n and for any 7 e S2(t n ), 

'\+nc- y ip 2 n {\+o{\)) + e T e 



n + v 
h > -=- log 



1 + e T (I n - P y o)e 



n + v 1 , 9 \ 
>^-log(l + C'^). 



(6.3) 
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On the other hand, for any fixed a' > a, by properties of projection matrices and Lemma [T] (ii), we have, 
w.l.p, for s n < t n < r n and uniformly for y e S \{t n ), 

1 + Y r (I„ - P y )Y _ ] Y r (P y -P y0 )Y 



1 + Y r (I„ - P y o)Y 1 + Y r (I„ - P r o)Y 

t 

. y^^y.^y 



1 - 



(fi Q v0 ) T X T JP 7 - P y o)X r o^ r o + 208°) r X^ o (P y - P y o)e + 6 r (P y - P y o)6 



1 + Y T (I n - P r o)Y 



- 1 

> 1 

> 1 



e T (V y - P y o)e 
1 + e T (I n - P r o)e 



aoj.(iy\ - s n ) 



ncrp + op(l)) 
a'(\y\ - s n )logp 



n 

It follows by the inequality that log(l - x) > —2x when x e (0, 1/2), and by Assumption IA.3I (iii) which 
implies that (\y\ - s n )\og p/n approaches zero uniformly for y € S\(t n ) with s„ < t n < r n . Therefore, for 
large n, w.l.p, for s„ < t n < r n and uniformly for y e S \{t n ), 

h > lo § 1 ^ -ao(\y\ - s n ) log p, (6.4) 



2 \ n 
where ao = 2a'. It follows by Lemma A. 2 in [39 ] that 

h > 2~\\y\ - s n ) log(l + cV^ntp ), for s n < t„ < r n and uniformly for y e S \{t n ), and (6.5) 

I2 > -2~ l s„ log(l + CQn<p n ), for s n <t n < r n and uniformly for 7 € S%{t n ). (6.6) 

By Assumption IA.3I ( v). log p - o(log(l + c„ l ncf> )) Using (I6.3M6.6I ). we have, w.l.p, for s n <t n < r n and 
uniformly for <p < Cj < <f> n , 



Kriz) ~ 



p(y°\Z) 

and 

p(y\Z) ~ I , + v 



f i+c o iw tr 2 1(lrl "" } 

, yeS i(t n ), s„ <t n < r n , (6.7) 



< C exp (2 1 s n log(l + c nj> n ) - ^ log(l + C'xfa) 



p(y°|Z) 

< C (1 + Cify--, y e 5 2 fe), s„ < t n < r n . (6.8) 
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It follows by (16.71 ). (16.81 ). and Assumption I A. 31 (in) and (v), as n — > oo, 



and 



p(r°|Z) ~ ^ 



1 + c,, nip 



-2-'(lrl-^) 



r=s„+l 



P - s„ 
r- s„ 



1,-1. \~2 '('"-•'n) 

1 + c n nd> > 
u — m 



'1 +C" 1 ?10 ^ 



-2"V 



/•=! 



r! 



- c 



r=l ' 



1 + Cq 1 ^ ' 

n 2a o 



< c(exp(^ 2a o+2/(i +c -i„^)J_ ij _> 0, 



yes 2 (f„) 



p(y|Z) ~ , o n+v ~ , , o n+v 

-V^T ^ C #S 2 fe) • (1 + C^)~r < C + C'^)-- _» o. 

p(r°|Z) 



Note the above convergence holds in probability and is uniform for < cj < (p n and < t n < r n . As a 



consequence, min inf p(y |Z) — > 1 in probability. 

S„<f„<r„0 <ci,...,Cn<<^ n 



Proof of Proposition 13.31 

(13.21) follows immediately from Proposition 2 in E 
then by Chebyshev's inequality, 



Next we verify (|33Tl . Fix 2 < a' < a/2. If £ = 



P(f > a/ia„) - P (exp(£/ar') > exp((a/a')fia n )) 
< exp(-(a/a')na n )E{exp(£/a')} 

- (1 - 2/a')" p/2 exp(-(a/ar')jua„). (6.9) 
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Clearly, given P y , XjP y X y - follows^. Then it follows by $63 and the fact (^) < p r /r\ that 



max max X • P r X ; > as„ log p 

7&T(t n ) jeyO\ r 1 
V0<f„<j'„ 

< P max max X^PyX,- > as„ log/? 

\yeT( s „-\) ;ey°\y J J 

- Z Z P(xJP y Xj>arj B logp) 

yer(i„-l)\{0} ; ey o\ y 

Z Z £ ( P ( X ; *V X ; ^ lo § ^l p r)l 

yerO„-l)\{0} ;s y 0\y 

* Z Z p W^M 

ysT( S „-l)\{0) ;e y o\ 7 

< 5n 2 [(l-2/a'r 1/2 exp(-(a/a')log^)] lrl 

yer(j„-i)\{0} 

< „'fW[«i-2/. T "v*r 

i — i v ' 



r=l 
i„-l 



< *»£i[(l-2/a'r 1 V- a/o ']' 

r=l 

< s n [exp ((1 - 2/a')- 1/2 ^ ( ^'- 1} ) - l] = O(sjp) = o(l). 

Thus, with probability approaching one, for any y e T(t n ) with < t n < s n , 

^+ (X^ y P r XyO\ r ) < trace (x^ y P y X y o\y) < s n ma* max XjP y X ; - < as 2 n log/?. 

0<t n <s n 

To prove Theorem l3.4l we need to establish the following preliminary lemma. 

Lemma 2 Suppose e ~ N(0, cr^ n ). Adopt the convention that v y e/||v y || = when v y = 0, and e T P 7 e/\y\ 
when y is null. 

(i) . Fory € T (t n ), define v y - (/„ - P 7 )X \ Then max max K - P (^). 

(ii) . Fory e To(t n ), define v v - PyXyoffin. Then max max fer = Op( V*n)- 

77 0<f„<i„ yer (f„) H^ 11 



(iii). Fory e T\{t n ), denote y* = y Piy° which is nonnull. For any fixed a > 4, 

I e T {P y -P r )e 2 \ 

lim P max max — — — j — j — < acris n log p \ = 1. 



f iv j. r/je?i for any fixed a > 2, 



lim P max max e T P Y e/\y\ < acr^logo - 1. 

n^oo \o<t„<s„ yeT 2 (t„) J 
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Proof of Lemma |2] 



The idea of the proof is similar to that of Lemma Q] and Proposition 13-31 But there is some technical 
difference so we still present some of the details. We note the trivial fact \J Ti(t n ) c Ti(s„ - 1) for 

0<f„<.s„ 

I = 0, 1, 2. Thus, #( U T (t n )\ < #Tq{s„ - 1) < 2 s ". The proof of parts (i)-(ii) follow immediately by 

ED- 

For part (iii), fix a' such that 2 < a' < a/2. Then the desired conclusion follows by (16.91 ) and the the below 
argument 

/ e T (P y -P r )e . 

P max max — — — ■ — ■ — > acris„ log p 
\0<f„<.v„ re r l( r„) \r\-\r*\ 

I € T {V 7 -V r )e 2 

< max — — — — — >ao- s n log p 
XreT^-i) \y\-\y*\ 

< J] P (e T (P r - P r )e > ao%Qy\ - | r *|) Sfl log p) 

s n -2 

< ^ ^ (l-2/a'T r/2 p~ (a/a ' )rs " 
r=i M-\r*\= r 

Sn-2 



y (Sn _ i _ r) s » 1 r . (P-^' d _ 2 /a')" r/2 



s„-l 



< 4- ^ i [(1 - 2/ar 1,2 p l - {a/a ' )s "] r 

r=l 

< s s n » [exp ((1 - 2/a'y i/2 p 1 - (a/a ' )s ") - l] 

- 0(s s n n p l - (a/a ' )s ") = 0(p l - (a/a ' )s " +s ") = o(l). 



29 



For part (iv), fix a' such that 2 < a' < a. Then by ( 16.91 ) with a n = log p therein, we have 



P max max e r P r e/|y| > acri log p 
\0<t n <s n yeT 2 (t„) 

< P max € T P y €/\y\> acrllogp) 

\yeT 2 (s„-l) ' J 

< Zaoj,\ Y \logp) 

yeT 2 (s„-l) 
■v„-l 

< 2 Z (l-2/aT'" /2 exp(-(a/a')rlog/ 7 ) 

r=l \y\=r 

- Z( / '""' , )[(i-2/«r 1/ V (a/a ' ) ] r 



r=l 



< ^ ^-[(1 - 2/a'y l/2 p l ' {a/a ' ) ] r 

r=l 

< exp ((1 - 2la'T ll2 p l - {ala,) ) - 1 = o(l), 

which shows part (iv). 

To show part (v), fix a > a' > 2. By ( 16.91 ) with a,, = Clog(2i M ) therein, we have 



P max max e r P r e/|y| > aCcr 2 , log(2s n ) 

\0<t„< S „ yeT (t„) 

< PI max e T V v el\y\>aCcrl\og{2s n ) 

\yeT U„-l) u 
( 

max e T P Y e/M > aCcr 2 ) log(2s n ) 

yeT (s„-l) 
y*% 



\ 



< Yj P^Prf/M > aCcr 2 log(2s n )) 

yeT (s„-\) 
y+§ 

< Y Z (1 - 2 l a 'y r ' 2 ex P (-(«/«')rC logfe)) 



r=l rcy o 
|y|=r 



= Y [r [ (1 " 2 / a ')" 1/2 ex P (-( ff /«')C log(2^))] r 

= [i + (i - 2/ a r m (2s n r (a/a ' )C ] s " - 1, 

which is small when C > is chosen to be sufficiently large. This proves part (v). 
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Proof of Theorem S3 



To make it more readable, we sketch the idea of the proof. We will first show that for y e T\(t n ) with < 
t„ < s n and 7 n y° + 0, max p{y\Z) / p(y n y°|Z) converges to zero in probability. Note the denominator is 

yeTi(t n ) 

max p(y|Z) max p(y|Z) 

bounded by max /?(y|Z), and thus y ^^" > , |Z) — > in probability. Secondly, we show that yer2<r "I Z) > 

in probability, i.e., any y e T^fe) is even worse than the null model. This will complete the proof. For 
simplicity, all the arguments in this proof section are built upon (13.21) and (13.3I ). which by Assumption lB.il 
have overwhelming probability when n is large. Next we finish these two steps. 

Step I: For y e T\(t n ), define y* = yC\y , which by our definition of T\{t n ), is nonnull. We will approximate 
the log-ratio of p{y\Z) to p(y*\Z), which can be decomposed as follows 

(\+Y T (I n -X 7 V- i Xl)Y) 



I p(y\Z) \ 
° g \p(y*\Z) 



log 



p{y) 

p(y*) 



1 / det(W y ) 
+ 2 ° g \det(W r .) 



n + v , 
+ ^^log 



n + v 



log 



l+Y T (I n -X r U;lxl,)Y] 



n + v , 
+ ^^log 



1 + Y T (l n - P y )Y 
' 1 + Y r (I„ - P y )Y ^ 
a+Y r (I„-P r .)Y 



l+Y 1 (I„-P y .)Y 

Denote the above five terms by I\ , I2, h, h, Is- Clearly, Ii is bounded from below and ^3 > 0. To approximate 
I4, we use the following Sherman-Morrison- Woodbury matrix identity (pp. 467, [42]), 

u;. 1 - {x T r x r y l = -(x T r x r \ { (iv + (x^Xy.)" 1 ) 1 (x^x^ 1 . 

Then by Y = X y o/? y0 + e, 



l+Y^-Xy-U-JX^Y 



- 1 + 



1 + 



l+Y y (I„-P y .)Y 
Y r X y .((X^,X y ,)- 1 -U y , 1 )X^,Y 

1 + Y T {l n - V r )Y 
Y T X r (X T r X r T l V r + (X^rY^X^r^Y 



1 + Y T (l n - P r )X?Y 



< 1 +<p~ 



Y r X y *(X y »X y *) 2 X y »Y 
1 + Y T {\ n - P y .)Y 



< 1+20" 



08^ o )^X^X y .(X r ,X y .)-^X y ,X y o^ o + ^X y .(X y .X y .)- 2 X y ,6 
1 + Y T (I„ - P y .)Y 



Without loss of generality, assume X y o = (X y », X y o\ y «) and jS y0 = ((fi y *) T , O8 y0 ^ y .) r ) r - By a direct calcula- 



tion it can be examined that 



x_/)X y *(x y «x y «) 2 xL,x, 



l| y »| (X y ,X y *) ^X^XyD^y, 

K X^ X y .(X y ,X y .) 1 X T X y «(X y ,X y «) 2 X y .X y0 



°\y* ) 
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By Assumption lB.il w.l.p., 

which implies, w.l.p., /l + (X yfl X y »(X 7 '»X y *)~ 2 X y .X r oJ < 1 + ^ = 0(1). Thus, 

08 yO ) r X^ o X r .(X^X y .)- 2 X^X r oy8 yO < (1 + d -^)k n . (6.10) 
By P y . < P y o, £'{e r P y oej = cr 2 s„ implying e r P y oe = 0p(s„), and (13.21 ) of Assumption [BTTJ we have, w.l.p., 

e r X y .(X^X y .)~ 2 X^.e < ^ T Py° e = P (s n /n). (6.11) 

On the other hand, by Assumption IB . 3 1 (i) 

Y r (I„ - P y .)Y > Y r (I„ - P y0 )Y = 6 r (I„ - P y o)e = ncT 2 (l + o P (s n /n)) = mr 2 (l + o P (l)). (6.12) 

Combining (16.10l )- (16.12l) . and using the fact k n > s n i//^?s> s n /n, we have for < t n < s n and uniformly for 
7 e rife), 

1 + Y r (I„ - X y .U 7 iX^)Y (i + d oPn /n)k n + Opisjn) 2(1 + d oPn /n)k n 

r <l+20 - = 1 + - (l + Op(l)). 

1 +Y r (I„-P y .)Y 7«r 2 (l + o P (l)) n^_o% 

It follows by k n = 0{<p ) and p n - o{n) (Assumption IB .31 (iv) and (v)) that for < t n < s n , uniformly for c,s 

— n ' 

e [<p , <p n ] and uniformly for y e T\{t n ), < -I4 = Op(l). 

Next we present lower bounds for I5. Assume, without loss of generality, that X y o = (X y «,X y o\ y .) and 
j6 y0 = ((J$*f, 08 yO \ y .) r ) r - Then it follows by Y = X^ + e and (P y - P y .)X y . = that 

Y r (P y - P y .)Y = (08 y ,) T X^, + 08 yO ^ y ,) r X^ r , + e r )(P y - P y .)X y o/? y0 (X y ./J°, + Xy^/?^, + e) 

= Oe y0 ^ r J r X^ y ,(P y - P r )X^ r fiP a ^ + 208 yV ,) r X yV ,(P y - P r )e + e T (P Y - P r )e 
< 2(fi°^ r ) T X T ^ r (P Y - P r )X r o^/3\ r + 2e T (P y - P r )e. 

By (|33T > of Assumption lB.il 

(^? / 0\ 1/ .) r X^ (P y - Pv')Xjll y .A| v , < {^.SfXjQ ,P y X yO\ y*0\ 



y°\r*' y°\y 7 7 )A 7 a \7*t i y s >\y A y o\ y .« r y A y u \y**V\ y * 

y o\ y ) rX y O\ y PyX y O\ y ^ y0 ^ y < PnWyQ^ 



By Lemma [2] (hi), w.l.p., e r (P y - P 7 *)e < acr^s n (\j\ - |y*|) log p < acr^s 2 log p, where a > 4 is prefixed. 
Therefore, w.l.p, for any < t n < s n and 7 e T\(t n ), 

Y r (P y - P y .)Y < 2(p„|^ y(V || 2 + acrlsllogp) = 2max{p„, ^log/^KI^oJI 2 + 0(1)). (6.13) 
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We approximate the term Y T (l n -P y »)Y. Denote v y * = (I„ -P y «)X r o\ y »/r ^ „. A direct examination verifies 
that (I„ - P y *)X y o = (0, (I„ - P y »)X y o\ y .) which leads to 

Y r (I„ - P y .)Y = 08^ y J r X^ y ,(I„ - P r )X y0xr ^ r + 208^^X^,(1,, - P r )e + e\l n - P r )e 
= \\v r \\ 2 + 2vl,e + e T (l n -P r )e 
( 2 \ v lM l 

> ||v y .|| 2 1-^— • — +6 r (I, 1 -P y o)6. (6.14) 

i IIVy.ll HVyllJ 7 

By Lemma[2](i), uniformly for y* , = Op( yfs^). Since y°\y* + 0, by Assumption lB.il 

X- (-X^ r ,(I„ - P y .)X y o\ y .) > d~ l - ^, (6.15) 



which implies 

||v y .|| 2 > n{ctf - ^)W\/ > (d^ 1 - 
By Assumption IB .31 (iii). i.e., s n = o(nijfy, we have (16.141) is greater than 

||v y .|| 2 (l + op(l)) + mr 2 (l + op(l)) > (n(dj - ^l^/ + «cr 2 ) • (1 + o P (l)). (6.16) 

Now combined with d6.13l >-( l6.16l ) we obtain w.l.p. 

1 + Y T {l n - P y )Y = i _ Y r (P y - P y .)Y > i _ C^jftjlogp) 



l+Y'(I„-P y ,)Y l+Y'(I n -P y .)Y 
where C > is constant unrelated to y and n. This shows that w.l.p., for any < t„ < s„ and uniformly for 

7 e T x (t n ), 



n + v ( 1 + Y r (I„ - P y )Y ^ 
lo 



> -C"max{p„,.s 2 iogp}, 



2 & U+Y T (I„-P y ,)Y 
where C" > is constant unrelated to y and n. 

To conclude Step I, we still need to approximate 1% given as follows. Since U y * is a submatrix of U y , it 
follows from the determinant formula for block matrices (pp. 468, [42]), and (16.151 ) that 

det(U y ) = det(U y .)det(2;( y . +X^ r ,(I„-X y .U y -, 1 X^)X y \ y .) 

> det(U r .) det (x-{ r + X T yXr (I n - P r )X Ar ) 

> det(U y .) det (S"^ + (nd Q 1 - p„)I| y \ y1 ) . 



Therefore, 



det(W y ) _ det(E y ) det(U y ) 

det(W y .) ~ det(2 y «) det(U y .) 

> det(E y \ y .) det (S"^ + (nd^ 1 - p„)I| y \ y *|) 
- det (l| y \ y «| + (nd^ 1 - p„)E y \ y o) 

> det ((1 + {nd.Q l - P«)0 fI ) I | r \ r °|) 

= (1 + {ndj -p n )<p > 1 + (nd l - Pn )cf> , (6.17) 
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which shows that I2 > 2 1 log(l + (nd Q 1 -p„)</> )■ By Assumption |B3](v) we have, w.l.p., for some constant 
C > 0, for any < t n < s n , uniformly for c,s € [<p , <p n ] and uniformly for 7 e T\{t n ), 

Pi7lZ) ,r 7 , < 4^ < C-expf^^logCl + -p n )<l> ) + C"max{p n ,^logp}) = (l). (6.18) 
max p(j\L) p(y*\Z) v — « 7 

max p(y|Z) 

This proves = o P (l). 

rer (i„) 

Step II: To accomplish the second step, we consider the following decomposition for any < t„ < s n and 

'det(W y )\ 



« + V , 



det(W )/ 
B + v , fl+Y^I^-XyU^XpY^l 



1 + Y r (I„ - P r )Y 



« + v, (1 + Y r (I„ - P y )Y 



Denote the above four terms by I\,l2,h,h- Similar to the arguments in Step I, l\ is bounded from be- 
low, 73 > 0. So we only approximate I2 and I4. First we approximate I4. By ( 13.31 ) of Assumption IB.U 
Xy,P y X y o < p n I Sn . Let v y - P y X y o/?y,, immediately we have ||v y || 2 < p„||j6 y0 ll 2 = Pnk n - By Lemma[2](iv), 
we have w.l.p., e r P y e < ao-g^llog p, where a > 2 is prefixed. Therefore, w.l.p., for any < t n < s n and 
uniformly for 7 e T%{tn), 

Y r P y Y = 0eO o ) r X^ o P y X y o^ o +208; o ) r X^ o P y 6 + 6 r P y 6 
= ||v y || 2 + 2v T Y e + e T V 7 e 

< 2||v y || 2 + 2e T Y 7 e 

< 2p n k n + 2aa\t n log p. 

On the other hand, from E{\(X y0 fi° y0 ) T e\ 2 /\\X y0 fiP y0 \\ 2 } = 0% we have |(Xy>j^) 7 'e|/||Xy>j8° ) || = P (1). By 



3.21 ) of Assumption |B.1[ ||X y o/?y,|| 2 > nd Q l k n . Therefore, we have 

Y r Y = \\X r0 /3 y0 \\ 2 + 2(X 70 /3° y0 ) T e + e T e 



= nx y < ii 2 (i + O p(^L]) + ^ 

= ||X y o/?°,|| 2 (1 + op(l)) + nalil + op(l)) 
> {dl l nk n + ncrl) ■ (1 + op(l)). 



Then by t n log p < s„ log p, for any < t„ < s„ and uniformly for 7 € T2(t n ), 

1 + Y r (I„ - P y )Y Y r P y Y 2(p„fc„ + aa\t n log p) C max{p„, s 2 log p} 

t = I ? — > t 1 ~ • U + Op(lJJ > I , 

1 + Y r Y 1 + Y r Y nid^K + Oq) n 

(l+Y^(I — P )Y \ o 
— 1+ Yr y y — ) ^ _ C" max{p„, 5- log p), 

where C" > is unrelated to 7 and n. 
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Finally we approximate I2 for 7 e ^(f,,)- Since |y| > 1, we have 

det(w r ) = det(l| r , + 2y 2 X^X y Ly /2 ) > det((l + nd^<p )I W ) > 1 + nd^4> . 

Therefore, I2 > 2 _1 log(l +nd^ l (p ) » max{p„, s 2 log /?} (Assumption IB .31 (v)). As a consequence, we 
have, w.l.p., for some constant C > 0, for any < t n < s n , uniformly for c,s e [d> ,(f> n ] and uniformly for 
7 e T 2 (t„), 

< C • exp (-2" 1 log(l + rcJ" > n ) + C" max{p„, £ log p}) - o(l). (6.19) 
This completes Step II, and thus completes the proof of Theorem [3~4l 



Proof of Theorem 13.51 

We begin with the following decomposition 

1+Y'(I„-P,)Y 



, M0|Z)\ , /P(0)\ 1, / 1 \ - + y ( 1+Y r (l„-X,U; I X^)Y ' 



« + v , f 1 + Y T Y 



+ — — lOE 



l+Y'(I„-P y )Y, 

Denote the above four terms by J\, J 2, J3, J a- Clearly, Ji is bounded below. The approximation of J3 is 
exactly the same as the approximation of I4 in Step I of the proof of Theorem 13.41 By replacing 7* therein 
with 7, one can show by going through the same procedure that < -J3 = Op(l), uniformly for CjS 
e [<p , 4> n ]. So we only need to approximate J 2 and J 4. 

1 v^v Y^P Y 

To approximate J 4, note >t , v w - 1 + r 7 — . So we only approximate the numerator and 
denominator respectively. Let v y = P y X y o/? y0 . Immediately we have 



v y - P y (X y ,X y0 \ y )| 7 \ - X y /? y + P y X y0 \ y ^ ^. 



y°\r 

It follows by (ED of Assumption ED and Wfi^W 2 < /olb8 y l| 2 that 

i(/? y ) T x£p y x Ay /?° v i < iix yJ s r ii • WyX^^w < irxy/jjii • ^ii/^ii 2 < ux y /? y ii • V^i^n. 

It follows by (O of Assumption |BJ] that ||X y j8 y || > yjnd~ l \\fi^\\ > ^Jnd~ 1 ^. Thus, by p n - o{n) 
(Assumption IB.3I ( v)) 

i^ y )^x y r p y x y0 , y ^ v i 

||X y) 8 y || 2 ^nd- 1 



Similarly, one can show — J v \V y = O % = o(l). Then 



|v y || 2 = ||X yJ 8 y || 2 



' Off/X^X^/^ ||P y X y(V /? y(V || ^ 
\\X 7 %\\ 2 + ||X yJ 6 y |p 



||X y /? y || 2 (l + o(l)). 
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Therefore, by Assumption IB.3l (iii). and Lemma|2](ii), 



Y r P y Y = ||v y || 2 + 2v T 7 e + e T V Y e 



>- *s[ 1+0 >(& 

= ||X y) 3 y || 2 (l + 0P (1)) 
> ||X y) 8 y || 2 /2, w.l.p. 



(6.20) 



On the other hand, if we let v y = (I„ - P y )X y o\ y /?° , , then by (13.21 ) of Assumption lB.il 



r^ A r°\yP y o\ y > 



°\y J y°\y ? \r p y°\y 
Therefore, by Lemma|2](i) and e T (l n - P y )e = ncr 2 (\ + op(l)), 



%\\ 2 . 



Y r (I„-P y )Y = ||v r || 



1 + 



2v^ 
IM 2 



+ e 1 (I n - P y )e 

2, 



|^ y || 2 (l+ p(l)) + ^(l+ p(l)) 
? y || 2 + WO" 2 ) 

S°ll 

Define ft = cr 2 /(J /o)- Consequently, by dOfl and (IQTT) . and Ib8 y || 2 > i/r 2 , w.l.p., 



< 2(|^ y || 2 + na\) w.l.p. 



< 2n(rfo/oll^ll 2 + trg). 



1 + 



Y P y Y 



-l||fl0||2 



> 1 + 



nd^m 



> 1 + 



1 



> 1 + 



(6.21) 



1 ^ 
mm < — , 



l+Y r (I„-P y )Y Wo/olL8 y ll 2 + <r 2 Q ) 4d 2 f tf + So 4d 2 } f l2'2£ J" 



Thus, 



n + v 



1 + 



mm 



1 ^ 

r 1 k/o 1 H 2 ' 2 ^/ 



(6.22) 



Finally we approximate / 2 . Since det(w y ) - det(l w + 2 y /2 X^X y L y /2 ) < (1 + do"0n) lrl - Then 



7 2 = - log 



> -— log(l +d n<p n ). 



Uet(W y )J 

Combining (16.221 ) and (16.231) . there exists constant C such that, w.l.p., uniformly for c/S € [0 , 

p(m ^~ [s n n + v ( i . ri ^\Yi 

^^•exp^logd^)-— log^^-mmj-,-!]], 



(6.23) 



which approaches zero by Assumption IA.3l (iv). This completes the proof. 
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Proof of Theorem 13.61 



We observe that 



min Pg (y°\Z) (6.24) 
= min I p(y Q \c, Z)g(c)dc > min I p(7°|c, Z)g(c)dc > I g(c)dc ■ min inf p(7°|c, Z). 

.?„<r„<r„ Jq s„<t„<r„ J . J . i„</„<r„ <c<0„ 

—/I — H — " 

By Theorem 13.21 min inf p(y°\c,Z) - 1 + o P (l). By Assumption, ff " g(c)dc = 1 + o(l). Thus, by 

i,<J„<r, <c<<p„ t* 

(16.241 ). min ^„(7°|Z) > (1 + o(l)) • (1 + op(l)) = 1 + op(l), which proves the desired result. 

s n <t n <r„ 

Proof of Theorem 13771 

Define 

p(y\c,Z) p(y\c,Z) 

D\ n - max sup max - , and D 2n = max sup max . 

0<f„<s„ yertfe) p( 7 n 7 U |C, Z) 0<r„<.v„ < c <, yer 2 (f„) /?(0|c, Z) 

— /i — n 

By ( 16.181 ) and ( 16.191 ) in the proof of Theorem 13 .41 D ln = op(l) and D 2n = op(l). For any y € T\(t n ), denote 
7 * = y n 7°. Then 

P*(7lZ) 

n°° r-<Pn r<Pn 

p(y\c,Z)g(c)dc = p(y\c,Z)g(c)dc < D ln p(y*\c, Z)g(c)dc = D ln p g (y*\Z) < D ln max 

Jo Ja Ja yer (t„) 

— » — /i 

Therefore, 

max b (y|Z) 

yeTi(«„) 6 

max Z -^ L ^ — T-F^T ^ D i« = °pW- ^ 6 - 25 ) 
0<f„<i„ max p„ r Z 

Likewise, for any y e T2(t n ), 

Pg (y\Z)= \ p(y\c,Z)g(c)dc < D 2n \ p(0\c,Z)g(c)dc = D 2n p g (0\Z) < D 2n max Pg (y\Z). 

Ja J0 yeT (t„) 

— 11 — II 

Therefore, 

max p g (7|Z) 

max — - < D 2n = o P (l). (6.26) 

0<t„<s n max D„(7|Z) 
yeroft,) 

The desired conclusion follows immediately from (16.251 ) and (16.261) . 
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Proof of Theorem 




. Theorem 13.51 implies D n = op(l). Then 



r<t>n 



r<Pn 



p g (0|Z) = p{%\c,X)g{c)di 



'c < D, 



n 



p(y\c, Z)g{c)dc = D n p g {y\Z) 



Thus, 



p g (riz) 



< D n - op(l), which completes the proof. 
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